========================================================
In this project, I am going to explore a dataset in which contains information on red wine quality and chemical properties associated with them. With the help of the statistical program R, I am going to first conduct preliminary investigation to see if there are any relationships among variables, and further illustrate them with plots. The dataset is available for download here, and its documentation is available here.
Let’s run some basic functions to have a glance of the dataset.
# Check the general structure of the dataset
str(df)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
# Double check to see if there are any dupicates
anyDuplicated(df)
## [1] 0
# A glimpse on the statistical summaries on each variable
summary(df)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Some findings at first glance:
X seems to be the ID of each winequality here is supposed to be a categorical vairablequality and .sulfur.dioxide, all other variables are continuousLet’s make a set of histograms for all the variables to have a general idea on the distribution first.
Looking at the histogram of wine quality, we can see that most of the wines receive about average ratings which are based on a discrete range, and there are not really any extreme cases (Outliers). Although not obvious, there is some level of normal distribution.
Based on this distribution, let’s instantiate another variable on the basis of ratings. Three categories, ‘good’, ‘average’, and ‘bad’ will represent wines that receive rating of 7 or above, 5 or 6, and below 5 respectively.
##
## bad average good
## 63 1319 217
Acids are major wine constituents and contribute greatly to its taste. According to the documentation and by researching online,
fixed.acidityandvolatile.acidityare two different types of acids (tartaric and acetic); so let’s instantiate another variable that stands for overall acidity of wine.
Although by looking at the above histograms, we can tell the general properties that wines have, we cannot really tell what separates the good ones from the bad ones.
We need to perform a comparison in order to tell the difference. Let’s first compare the overall acidity between the good ones and bad ones.
I believe there are reasons why wine experts divide acidity into two groups; so let’s dig deeper by plotting the acids individually along with other variables.
Here, we plot the probability density functions of the good ones and the bad ones. Generally speaking, we want to focus on the area where the two groups do not overlap becasue that can serve as a reference of what separates the good ones from the bad ones. In other words, variables in which the two groups have obvious difference in distribution could be indicators of wine quality, or more or less help us predict wine quality.
Looking at the above plots, we can see there are relatively obvious differences in
volatile.acidity,citric.acid,pH,sulphates, andalcohol. And if we look at the plots individually, we can see that the ranges of spikes are especially wide amongcitric.acidandalcohol(On its own scale). This suggests that the two variables might serve as indicators of wine quality.
As we saw from the plots above, some variables’ distribution is skewed. This suggests that there might be outliers in the dataset. Let’s validate this with the help of boxplot.
Now we have a much clearer picture on outliers.
As mentioned in the beginning, there may be multicollinearity among variables; so let’s get a quick snapshot of the correlations among them.
Indeed, multicollinearities exist in this dataset. Basically, the darker/ sharper the grid, the stronger the correlations associated with the pair of variables. One interesting finding is that there seems to be a negative correlation between
citric.acidandvolatile.acidity, andvolatile.acidityseems to have a slightly positive relationship withpH. This is actually quite counterintuitive.
From the density plots, we can see some variables may have an impact on wine quality. Let’s see if we can capture the trend/ tendency with the help of boxplot.
Now the relationships between certain variables and wine quality are much clearer. In general, the steeper the boxes are positioned against each other, the greater the impact of that specific variable on wine quality. Here,
fixed.acidity,volatile.acidity,pH,citric.acid,sulphates, andalcoholall seem to have impact on wine quality. This is especially true foralcohol,sulphates,volatile.acidity, andcitric.acid. Let’s double check by calculating the correlation coefficient of each variable againstquality.
## [,1]
## fixed.acidity 0.12405165
## volatile.acidity -0.39055778
## citric.acid 0.22637251
## residual.sugar 0.01373164
## chlorides -0.12890656
## free.sulfur.dioxide -0.05065606
## total.sulfur.dioxide -0.18510029
## density -0.17491923
## pH -0.05773139
## sulphates 0.25139708
## alcohol 0.47616632
## overall.acidity 0.10375373
To summarize:
As mentioned previously, although good wines generally have relatively high acidity, they also in general possess lower level of
volatile.acidity. Looking into the documentation, the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. It seemsvolatile.acidityis negatively correlated with other acids.
Let’s plot
volatile.acidityagainstcitric.acidandfixed.acidity.
##
## Pearson's product-moment correlation
##
## data: df$volatile.acidity and df$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
##
## Pearson's product-moment correlation
##
## data: df$volatile.acidity and df$fixed.acidity
## t = -10.589, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3013681 -0.2097433
## sample estimates:
## cor
## -0.2561309
volatile.acidityis indeed negatively correlated with other acids. This is especially true forcitric.acid, given the correlation coefficient is -0.552.
According to the documentation,
total.sulfur.dioxideincludesfree.sulfur.dioxide. This implies that there should be a correlation between the two variables.
##
## Pearson's product-moment correlation
##
## data: df$free.sulfur.dioxide and df$total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6395786 0.6939740
## sample estimates:
## cor
## 0.6676665
As expected, there is such a strong positive correlation of 0.668 between the two variables.
According to the documentation,
densityis a variable that depends on alcohol level and sugar content. This suggests that there may be a correlation betweendensityagainstresidual.sugarandalcohol.
##
## Pearson's product-moment correlation
##
## data: df$residual.sugar and df$density
## t = 15.189, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3116908 0.3973835
## sample estimates:
## cor
## 0.3552834
##
## Pearson's product-moment correlation
##
## data: df$alcohol and df$density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
The results are as what we expected.
residual.sugaris positively correlated withdensity, andalcoholis negatively correlated withdensity.
pH should be hugely affected by wines acidity as
pHis essentially a measure of acidity; so let’s plotoverall.acidityagainstpH.
##
## Pearson's product-moment correlation
##
## data: df$overall.acidity and df$pH
## t = -37.418, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7087579 -0.6564574
## sample estimates:
## cor
## -0.6834838
The results are as expected.
overall.acidityis negatively correlated withpH.
The last plot seems to be the most linear compared to others. Sush a strong coorelation suggests that we might use a linear model to predict
pHbased onoverall.acidity.
According to the boxplot, the residuals are generally consistent across qualities, except the ones with the lowest quality. The ones with quality 3 has a median well below 0. Supposedly, acidity itself should be the biggest factor that alters
pH. Nonetheless, this does not really apply to this group of wine. Chances are there might be omitted variables.
Previously, we spotted numbers of correlations among variables. Now let’s visualize them on rating and quality basis, and see if the pattern is consistant across groups.
The patterns are consistent across groups.
Previously, we found that the ranges of spikes are especially wide among
citric.acidandalcohol(On its own scale) when doing univariate analysis. We also found that there are relatively strong correlations betweenquality,volatile.acidity, andsulphates. Let’s see how the good ones and bad ones are distributed in a scatter plot.
By plotting them in scatter plots, our ideas are further validated. Although the plots do not have a crystal clear generalization, they provide us with a pretty clear picture on what separate good wines from bad wines.
This first pair of plots shows how wine groups are distributed based on the variables. The variable is a fairly good indicator when the distributions do not overlap and start to get farther from each other.
This second pair of plots shows how a drop in
valitility.acidityand a rise insulphatescan enhance wines’ quality. Although they cannot tell the whole story, the plots are good enough for us to say that these two variables should not be neglected.
This third pair of plots further demonstrates the idea of the first pair of plots. Here, we can clearly see that where certain wines are clustered. We can be even more confident to say that these variables are some of the key factors to consider when evaluating wine.
In this EDA, I was able to identity some of the key factors that have impact on wine quality. Although wine quality is arguably subjective, the results we got from this analysis are reasonable. At least we can say that these vairables that we investigated do play a role on wine quality according to conventional/ industry standard. Generally speaking, acidity, sulphates level, and alcohol level are the ones that could alter wine quality most. There could be more probabilistic work if we want to run hypothesis testing on certain statistical summaries like the difference in alcohol means among groups. One thing that is worth mentioning is multicollinearity is found in the dataset. If we want to build a linear model, ideally speaking, we want the predictor variables to be related to the response variable, but not to be related to one another. Multicollinearity occurs when predictor variables are related to one another. Expected relationships may not hold under the presence of Multicollinearity. Worse, any hypothesis testing becomes unreliable. As we saw in the correlation table, quite a few varibles are correlated with one another. If we are to build a multi-linear regression model, we would have to deal with this issue. One common way to working with correlated predictor variables in a multi-linear regression model, is simply to remove one of the variables that is most related to the other variables. Choosing an predictor variable that is not as important is a common choice.